Michael Espero
2020-1-28
@michaelespero
I'm very happy to take part in the R Community.
What did I learn last year from the R Community?
What did I learn last year from the R Community?
What did I learn last year from the R Community? (R-User Groups & satRday)
parApply() helps make loops run faster.testthat package can help us make unit testing more common.Tensorflow works well with R and PythonWhat did I learn last year from the R Community? (R-User Groups & satRday)
dplyr and others prefer data.table.
dtplyr.tidytext allows us to pipe common NLP tasks.-Explicitly anticipate interoperability challenges
How might our teams get tripped up in setup?
# We have the people who made the pacman package to thank for this one.
library(pacman)
# This is a wrapper that trys to load what you ask of it and attempts to install what's not available.
p_load(tidyverse, janitor, report, ggthemes)
# Load the reticulate package like any other package.
library(reticulate)
# Specify the location of the version of Python you'd like to use.
use_python("/Users/michaelespero/opt/anaconda3/bin/python")
# The initialize argument attempts to start Python bindings if they aren't already available.
py_available(initialize = T)
[1] TRUE
# Specify the name of the module you wish to check on in quotes.
py_module_available(module = "pandas")
[1] TRUE
# Detect some information about your setup
py_config()
python: /Users/michaelespero/opt/anaconda3/bin/python
libpython: /Users/michaelespero/opt/anaconda3/lib/libpython3.7m.dylib
pythonhome: /Users/michaelespero/opt/anaconda3:/Users/michaelespero/opt/anaconda3
version: 3.7.4 (default, Aug 13 2019, 15:17:50) [Clang 4.0.1 (tags/RELEASE_401/final)]
numpy: /Users/michaelespero/opt/anaconda3/lib/python3.7/site-packages/numpy
numpy_version: 1.17.2
python versions found:
/Users/michaelespero/opt/anaconda3/bin/python
/Users/michaelespero/.virtualenvs/r-reticulate/bin/python
/usr/bin/python
/usr/bin/python3
/usr/local/bin/python3
/Users/michaelespero/anaconda3/bin/python
/Users/michaelespero/.virtualenvs/py3-virtualenv/bin/python
/Users/michaelespero/.virtualenvs/test-v37/bin/python
/Users/michaelespero/anaconda3/envs/r-tensorflow/bin/python
/Users/michaelespero/anaconda3/envs/spacy_condaenv/bin/python
/Users/michaelespero/anaconda3/envs/two/bin/python
import()
import("pandas")# The import function allows you to name an object with access to its functions using $.
pd <- import("pandas")
# pd$
Python packages may be installed using the console.
# Let's make a simple function in Python that reads a csv file, selects only rows with organic produce, and selects 5 specific columns by name.
import pandas as pd
def make_guac(file):
guac2 = pd.read_csv(file)
guac2 = guac2[guac2["type"] == "organic"]
guac2 = guac2[["Date", "year", "type", "AveragePrice", "region"]]
return guac2
source_python("guac.py")
guac2 <- make_guac("avocado.csv")
head(guac2)
Date year type AveragePrice region
9126 2015-12-27 2015 organic 1.83 Albany
9127 2015-12-20 2015 organic 1.89 Albany
9128 2015-12-13 2015 organic 1.85 Albany
9129 2015-12-06 2015 organic 1.84 Albany
9130 2015-11-29 2015 organic 1.94 Albany
9131 2015-11-22 2015 organic 1.94 Albany
library(readr)
# Let's use read_csv() to read in "tweets.csv", a data file containing a selection of tweets on the subject of wisdom. We'll save it as df_r to indicate that it's an R dataframe. Here's the data source: https://www.kaggle.com/hsankesara/the-tweets-of-wisdom
df_r <- read_csv("tweets.csv")
library(tidyverse)
# glimpse() gives us a nice peek of our dataframe complete with its dimensions (number of rows and columns), column names, column types, and a little of the contents of each of the columns.
glimpse(df_r)
Observations: 31,115
Variables: 6
$ author_name <chr> "Naval", "Naval", "Naval", "Naval", "Naval", "Naval", "…
$ created_at <dttm> 2019-08-07 22:36:56, 2019-08-07 05:00:38, 2019-08-07 0…
$ handle <chr> "naval", "naval", "naval", "naval", "naval", "naval", "…
$ likes <dbl> 7566, 21886, 6462, 466, 3971, 6141, 15681, 1633, 510, 2…
$ retweets <dbl> 1498, 5984, 1266, 61, 906, 1114, 4805, 153, 126, 690, 7…
$ tweet_content <chr> "Unresolved thoughts, prematurely pushed out of the min…
# The "tweet_content" column contains character vectors with the text of each tweet. R begins row indexing with 1, so specifying tweet content from th 1st row returns the first tweet in df_r.
df_r$tweet_content[1]
[1] "Unresolved thoughts, prematurely pushed out of the mind, pile up in an internal landfill - which eventually pokes out of the subconscious and manifests as chronic, nonspecific anxiety"
# Let's get the author name along with the tweet to show who posted this opinion.
df_r[1,] %>%
select(c("author_name", "tweet_content"))
# A tibble: 1 x 2
author_name tweet_content
<chr> <chr>
1 Naval Unresolved thoughts, prematurely pushed out of the mind, pile up …
# Now that we have an idea of the first row of the tweet data in R, let's convert it to Pandas dataframe for use in python.
df_py <- r_to_py(df_r)
# While the previous code chunk was an R chunk, we can specify that we'll write in python in this one by indicating that in the above curly brackets.
# In python, instead of using library() to load packages, we can use "import" to load modules.
import numpy as np
import pandas as pd
import nltk
# In python, we can get the dimensions of a dataframe with ".shape" added after the data object. Notice we're accessing the Pandas dataframe with made in the previous R code chunk by using r.[pandas_df_name].
r.df_py.shape
# We can get the column names with ".columns" following our r.df_py Pandas dataframe from the above R code chunk.
(31115, 6)
r.df_py.columns
Index(['author_name', 'created_at', 'handle', 'likes', 'retweets',
'tweet_content'],
dtype='object')
# In the R version of the tweet data we checked the content of the first row's tweet content. Recall, it was an opinion regarding unresolved thoughts and anxiety. Let's check the author name and the tweet content for the first row of the Pandas version, this time using zero indexing to get to the first row of a dataframe in python.
r.df_py.author_name[0]
'Naval'
r.df_py.tweet_content[0]
'Unresolved thoughts, prematurely pushed out of the mind, pile up in an internal landfill - which eventually pokes out of the subconscious and manifests as chronic, nonspecific anxiety'
# We're back in an R chunk and let's check on our tweets about wisdom dataframes.
# First the pandas version.
df_py
author_name ... tweet_content
0 Naval ... Unresolved thoughts, prematurely pushed out of...
1 Naval ... The modern mind is overstimulated and the mode...
2 Naval ... The Lindy Effect for startups:\n\nThe longer y...
3 Naval ... @orangebook_ This was a good tweet.
4 Naval ... Social media lowers the cost of raising & ...
... ... ... ...
31110 Uncanny Insights ... Our general behavior is the reflection of our ...
31111 Uncanny Insights ... In every matter being an unorthodox is a sure ...
31112 Uncanny Insights ... You can change your way of thinking, by changi...
31113 Uncanny Insights ... People fear that they will be dispossessed if ...
31114 Uncanny Insights ... To diagnose your thoughts solitude is the firs...
[31115 rows x 6 columns]
# Let's see how many unique authors (of tweets) are in the dataframe.
unique(df_r$author_name) %>% length
[1] 2782
# The second argument is where you indicate the top n in terms of the third argument.
df_r_top5 <- top_n(df_r, 5, likes)
library(ggthemes)
# Aside from the standard aesthetic arguments we set our labels, rotate our x-axis labels, and let the Edward Tufte theme do the styling.
p1 <- df_r_top5 %>%
ggplot(aes(x = author_name, y = likes, color = retweets)) +
geom_col() +
labs(x = "Author", y = "Likes", col = "Retweets") +
theme(axis.text.x = element_text(face="bold", color="black", size=8, angle=45)) +
theme_tufte() +
scale_color_viridis_c()
p1
# desc refers to descending order by the variable you pass to it.
df_r %>%
arrange(desc(likes))
# A tibble: 31,115 x 6
author_name created_at handle likes retweets tweet_content
<chr> <dttm> <chr> <dbl> <dbl> <chr>
1 abolish ice … 2019-08-10 20:16:00 hashta… 1.35e6 424270 "oh my god, i need…
2 Mike Tyson 2019-01-16 22:19:41 MikeTy… 7.92e5 172538 "Stop sending me t…
3 Dominique Ap… 2019-04-19 16:27:58 Apollo… 5.48e5 104471 "It's taken me 45 …
4 Igbo Excelle… 2019-08-31 10:40:27 1ncogn… 4.57e5 146526 "I noticed African…
5 Igbo Excelle… 2019-08-31 10:40:27 1ncogn… 4.57e5 146526 "I noticed African…
6 Mark Phillips 2019-08-20 21:39:16 Suprem… 3.22e5 128305 "How Chick Fila Wo…
7 Bill Dixon 2018-12-09 18:54:56 BillDi… 3.17e5 43219 "About 5 years ago…
8 Fernando De … 2019-09-11 18:50:11 DeJesu… 3.11e5 65422 "Bro we’re in fuck…
9 Mike Drucker 2019-06-07 18:46:26 MikeDr… 3.06e5 69278 "Twitter is fun be…
10 Koopington F… 2018-12-29 04:50:40 koopa_… 2.70e5 90209 "This the most Flo…
# … with 31,105 more rows
# You might like to use = instead of <-- in R for assignment because it works in python too.
guac <- read_csv("avocado.csv")
glimpse(guac)
Observations: 18,249
Variables: 14
$ X1 <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, …
$ Date <date> 2015-12-27, 2015-12-20, 2015-12-13, 2015-12-06, 2015-…
$ AveragePrice <dbl> 1.33, 1.35, 0.93, 1.08, 1.28, 1.26, 0.99, 0.98, 1.02, …
$ `Total Volume` <dbl> 64236.62, 54876.98, 118220.22, 78992.15, 51039.60, 559…
$ `4046` <dbl> 1036.74, 674.28, 794.70, 1132.00, 941.48, 1184.27, 136…
$ `4225` <dbl> 54454.85, 44638.81, 109149.67, 71976.41, 43838.39, 480…
$ `4770` <dbl> 48.16, 58.33, 130.50, 72.58, 75.78, 43.61, 93.26, 80.0…
$ `Total Bags` <dbl> 8696.87, 9505.56, 8145.35, 5811.16, 6183.95, 6683.91, …
$ `Small Bags` <dbl> 8603.62, 9408.07, 8042.21, 5677.40, 5986.26, 6556.47, …
$ `Large Bags` <dbl> 93.25, 97.49, 103.14, 133.76, 197.69, 127.44, 122.05, …
$ `XLarge Bags` <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, …
$ type <chr> "conventional", "conventional", "conventional", "conve…
$ year <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, …
$ region <chr> "Albany", "Albany", "Albany", "Albany", "Albany", "Alb…
library(janitor)
# clean_names() gives us standard lowercase feature names
guac = guac %>%
clean_names() %>%
select(c("date", "year", "type", "average_price", "total_bags"))
names(guac)
[1] "date" "year" "type" "average_price"
[5] "total_bags"
# The type of avocado was encoded as a character vector. Let's make it a factor.
guac$type <- as_factor(guac$type)
library(lubridate)
# Notice that we're manipulating the our x-axis, date, to be shown by week.
p2 <- ggplot(guac, aes(x = week(date), y = average_price), color = type) +
geom_jitter() +
geom_smooth(method = "lm", se = FALSE) +
theme_tufte() +
labs(
title = "Is the Price of Guac Rising?",
x = "Week",
y = "Average Price",
caption = "Data Source: https://www.kaggle.com/neuromusic/avocado-prices")
p2
import seaborn as sns
# Plot r.guac using Seaborn
c_py_plot = sns.regplot(x = "year", y = "average_price", data = r.guac)
c_py_plot.set(xlabel = "Year", ylabel = "Average Price", title = "Is Guac Getting Richer?")
[Text(0, 0.5, 'Average Price'), Text(0.5, 0, 'Year'), Text(0.5, 1.0, 'Is Guac Getting Richer?')]
c_py_plot
c_py_plot
<matplotlib.axes._subplots.AxesSubplot object at 0x122675f90>
# Let's bring pre-processed corpi into the environment for easy access. See https://www.nltk.org/book/ch01.html
from nltk.book import *
# If we call the object "text8" we can see it's title. It appears to be a collection personal ads.
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
text8
<Text: Personals Corpus>
# Since this data must be about people seeking a mate, let's check concordance for the word "love". It'll return all instances of the word "love" along with a bit of context to see how it's used.
text8.concordance("love")
Displaying 10 of 10 matches:
od sense of humour , am romantic and love drives , fishing , camping and music
ives , fishing , camping and music . Love my 2 kids . Am looking for a lady wi
hip ASIAN LADY sought . Kids OK as I love family . Nice & honest guy . ASIAN L
ie mid 40s b / man f / ship r / ship LOVE to meet widowed lady over 50 , no ch
late , intelligent & very flexible . Love a good laugh , love life & enjoy con
very flexible . Love a good laugh , love life & enjoy contrasts and the finer
er callers welcome to reply . DO YOU LOVE TO DANCE ? Fun loving and employed ,
0s , working full time . Looking for love & laughter . Are you at least 59 , n
t not fanatical ? A bonus would be a love of dancing . GARDEN LOVER Hi ! I am
dinners , fine wine , romance & true love . YORKE PENINSULA LADY Late 70s , 53
# We can search the personals corpus for words thought to be similar to the word "meet", for instance. Here, similar refers to words appearing in common contexts with the target word.
text8.similar("meet")
relationship australian share beginning meeting find very
text8.dispersion_plot(["love", "looking", "fun", "true"])